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Introduction to deep learning (morning) 


Deep learning for object recognition 
(morning) 


Deep learning for object segmentation 
(afternoon) 


Deep learning for object detection (afternoon) 
Deep learning for object tracking (afternoon) 
Open questions and future works (afternoon) 


Introduction to Deep Learning 


¢ Historical review of deep learning 
¢ Introduction to classical deep models 
¢ Why does deep learning work? 


Machine Learning 
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¢ Solve general learning problems 


e Tied with biological system 
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Neural network 
Back propagation 
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¢ Solve general learning problems 
e¢ Tied with biological system 


But it is given up... 


¢ Hard to train 
¢ Insufficient computational resources 





¢ Small training sets 
¢ Does not work well 


Neural network 
Back propagation 


| Nature 
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1986 
¢ SVM 
¢ Boosting 


e Decision tree 
e KNN 


2006 


Flat structures 
Loose tie with biological systems 


Specific methods for specific tasks 
— Hand crafted features (GMM-HMM, SIFT, LBP, HOG) 


Deep Hierarchy Flat Processing Scheme 





Kruger et al. TPAMI'13 
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Deep belief net 








e Unsupervised & Layer-wised pre-training 
¢ Better designs for modeling and training 
(normalization, nonlinearity, dropout) 


¢ New development of computer architectures 
— GPU 
— Multi-core computer systems 





e Large scale databases 


Big Data! 


Machine Learning with Big Data 


Machine learning with small data: overfitting, reducing model complexity 
(capacity), adding regularization 

Machine learning with big data: underfitting, increasing model complexity, 
optimization, computation resource 


Prediction accuracy 


Deep learning 


Other machine learning tools 


SS 


Size of training data 


How to increase model capacity ? 


Curse of dimensionality 


V 


Blessing of dimensionality 


V 


Learning hierarchical feature transforms 
(Learning features with deep structures) 


D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: Highdimensional feature and its efficient 
compression for face verification. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, 2013. 
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we learning results 


task hours of  DNN-HMM | GMM-HMM 
training data with same data 


Switchboard (test set 1) 309 27.4 
Switchboard (test set 2) voi 23.6 
English Broadcast News rs Pas 18.8 
Bing Voice Search 30.4 36.2 
(Sentence error rates) fe 

Google Voice Input es 

Youtube 52.3 
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Deep Networks Advance State of Artin Speech /% | 
Deep Learning leads to breakthrough in speech recognition at MSR. Micresoftt 
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Description 
1 U. Toronto 0.15315 Deep learning 
2 U. Tokyo 0.26172 Hand-crafted 
3 U. Oxford 0.26979 reatures and 
4 Xerox/INRIA 0.27058 earning models. 


Bottleneck. 


Object recognition over 1,000,000 images and 1,000 categories (2 GPU) 


A. Krizhevsky, L. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS, 2012. 


Examples from ImageNet 
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images courtesy of ImageNet (http://www.image-net.org/challenges/LSVRC/2010/index) 
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¢ ImageNet 2013 — image classification catenge 


CT 





0.11197 Deep learning 
2 NUS 0.12535 Deep learning 
3 Oxford 0.13555 Deep learning 


MSRA, IBM, Adobe, NEC, Clarifai, Berkley, U. Tokyo, UCLA, UIUC, Toronto .... Top 20 
groups all used deep learning 


¢ ImageNet 2013 — object detection challenge 


on ee Mean Average Precision 


UvA-Euvision 0.22581 Hand-crafted features 





2 NEC-MU 0.20895 Hand-crafted features 
3 NYU 0.19400 Deep learning 
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¢ ImageNet 2014 — Image classification chalenge 


rank [rane Lert | scrbton 





Google 0.06656 Deep learning 
2 Oxford 0.07325 Deep learning 
3 MSRA 0.08062 Deep learning 


¢ ImageNet 2014 — object detection challenge 


Rank | Name ___| Mean Average Precision 





1 Google 0.43933 Deep learning 
2 CUHK 0.40656 Deep learning 
3 Deeplnsight 0.40452 Deep learning 
4 UvA-Euvision 0.35421 Deep learning 
5 Berkley Vision 0.34521 Deep learning 
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¢ ImageNet 2014 — object detection challenge 


RCNN Berkley UvA- DY =Y=) 0) [ak}t:4 0) mm Cole) M=1)\(=] am DY ={=) 0) [DENN (=) 
(Berkley) vision Euvision (Google) (CUHK) 


Model average n/a n/a n/a 40.5 43.9 50.3 
Single model 31.4 34.5 35.4 40.2 38.0 47.9 





Wanli Ouyang 





W. Ouyang and X. Wang et al. “DeepID-Net: deformable deep convolutional neural 
networks for object detection”, CVPR, 2015 
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¢ Google and Baidu announced their deep 
learning based visual search engines (2013) 


— Google 
e “on our test set we saw double the average precision when 
compared to other approaches we had tried. We acquired 
the rights to the technology and went full soeed ahead 
adapting it to run at large scale on Google’s computers. We 
took cutting edge research straight out of an academic 
research lab and launched it, in just a little over six months.” 


— Baidu 


Che New ork Gimes 


Neural network Deep belief net Google | 
Back propagation j F iti 
\ g "| , 1 mess ace recognition 
fap) 
Micresoft 
1986 2006 2011 2012 2014 


¢ Deep learning achieves 99.47% face verification 
accuracy on Labeled Faces in the Wild (LFW), 
higher than human performance 


Y. Sun, X. Wang, and X. Tang. Deep Learning Face Representation by Joint 
Identification-Verification. NIPS, 2014. 


Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are 
Sparse, selective, and robust. CVPR, 2015. 


Labeled Faces in the Wild (2007) 
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Unrestricted, Labeled Outside Data Results 


0.8526 + 0.0060 
0.8414 0.0041 
0.8554 0.0035 
0.8445 + 0.0046 

Associate-Predictl? O.9057 + 0,0056 
0.9310 + 0.0138 
0.9330 0.0128 
0.9242 + 0.0108 
0.9517 + 0.0113 
0.8402 + 0.0044 
0.9633 + 0.0108 
0.9130 + 0.0030 
0.9727 * 0.0068 
0.9735 £ 0,025 
0.9252 + 0.0038 
0.9313 + 0.0040 
0.9280 + 0.0047 
0.9645 * 0.0025 
0.9852 + 0.0066 


Table 6: Mean classification accuracy G and standard error of the mean Se. 











Deep Learning 


With massive 
amounts of 
computational power, 
machines can now 
recognize objects and 
translate speech in 
real time. Artificial 
intelligence ts finally 


getting smart. * 





Memory Implants 


A maverick 
neuroscientist 
believes he has 
deciphered the code 
by which the brain 
forms long-term 
memories. Next: 
testing a prosthetic 
implant for people 
suffering from long- 
term memoary loss 


IOBREAKTHROUGH 
TECHNOLOGIES 2013 





Temporary Social 
Media 


Messages that quickly 
self-destruct could 
enhance the privacy 

of online 
communications and 
make people freer to 
be spontaneous. = 





Smart Watches 


The designers of the 
Pebble watch realized 
that a mobile phone ts 
more useful if you — 
dont have to take it 
owit of vowir noek et 





Prenatal DNA 
Sequencing 


Reading the DNA of 
fetuses will be the 

next frontier of the 
genomic revolution. 

But do you really want 
to Know about the 
genetic problems or 
musical aptitude of 
your unborn child? = _, 





Ultra-Efficient Solar 
Power 


Doubling the | 
efficiency of a solar 
cell would completely 
change the 
economics of 
renewable energy. 
Nanotechnology just 
might make tt 
nnmeeihle 


Introduction [he 10 Tec 





Additive 
Manufacturing 


Skeptical about 3-D 
printing? GE, the 
world's largest 
manufacturer, is on 
the verge of using the 
technology to make 





jet parts. y 
Big Data from Cheap 
Phones 


Collecting and 
analyzing information 
from simple cell 
phones can provide 
surprising insights inte 
how people move 
about and behave - 
and even help us 
understand the 
enrean of cicaaces 


hnologies Past Years 





Baxter: The Blue- 
Collar Robot 


Rodney Brooks's 
newest creation is 

easy to interact with, 
but the complex 
innovations behind the 
robot show just how 
hard it is to get along 
with people. 5 





Supergrids 


A new high-power 
circuit breaker could 
finally make highly 
efficient DC power 
aride nracticsl 


Design Cycle start 





Domain knowledge Interest of people working 


on computer vision, speech 
recognition, medical image 
processing.,... 





Choose and Interest of people working 
Preprocessing and feature design model on machine learning 


design may lose useful 
information and not be 
optimized, since they are not 
parts of an end-to-end 
learning system 


{| 


Interest of people working 
on machine learning and 
computer vision, speech 
recognition, medical image 
end processing.,... 





Preprocessing could be the 
result of another pattern 
recognition system 





Person re-identification pipeline 





Photometric 
& geometric 
transform 


Feature 
extraction 


Pedestrian Pose Body parts 
detection estimation segmentation 


Classification 






Face recognition pipeline 


Face Geometric Photometric Feature 
alignment rectification rectification extraction 


Classification 


Design Cycle 
with Deep Learning 


Learning plays a bigger role in the 
design circle 


Feature learning becomes part of the 
end-to-end learning system 


Preprocessing becomes optional 
means that several pattern 
recognition steps can be merged into 
one end-to-end learning system 


Feature learning makes the key 
difference 


We underestimated the importance 
of data collection and evaluation 


Start 





end 


What makes deep learning successTul 
in computer vision? 


Li Fei-Fel Geoffrey Hinton 





One million images Predict 1,000 image CNN is not new 
with labels categories . 
Design network structure 


New training strategies 


Feature learned from ImageNet can be well generalized to other tasks and datasets! 


Learning features and classifiers separately 


¢ Not all the datasets and prediction tasks are suitable 
for learning features with deep models 


Training Training 


> 


Deep 
learning 






Classifier 1 Classifier 2 
Classifier B 


Prediction on task B 
(Our target task) 


Prediction Prediction 
on task 1 on task 2 


Deep learning can be treated as a language to 
described the world with great flexibility 


Collect data Collect data 





Preprocessing 1 





Preprocessing 2 Connection 


Feature design 


Classifier 


Introduction to Deep Learning 


¢ Introduction to classical deep models 


Introduction on Classical Deep Models 


¢ Convolutional Neural Networks (CNN) 


— Y.LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based Learning Applied to 
Document Recognition,” Proceedings of the IEEE, Vol. 86, pp. 2278-2324, 1998. 


¢ Deep Belief Net (DBN} 


— G.E. Hinton, S. Osindero, and Y. Teh, “A Fast Learning Algorithm for Deep Belief Nets,” 
Neural Computation, Vol. 18, pp. 1527-1544, 2006. 


e Auto-encoder 


— G.E. Hinton and R. R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural 
Networks,” Science, Vol. 313, pp. 504-507, July 2006. 


Classical Deep Models 


¢ Convolutional Neural Networks (CNN) 
— First proposed by Fukushima in 1980 


— Improved by LeCun, Bottou, Bengio and Haffner in 1998 
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Learned 


Convolution Pooling Filters 


Backpropagation 
W + W-7V7 JW) 


W is the parameter of the network; J is the objective function 


Target values 
Output layer 


Feedforward Back error 


operation propagation 
Hidden layers 





Input layer 


D. E. Rumelhart, G. E. Hinton, R. J. Williams, “Learning Representations by Back-propagation Errors,” Nature, Vol. 323, 
pp. 533-536, 1986. 


Wiring together firing together 


¢ CNN is a Sparsified network 
¢ Correlated neurons are connected 


— CNN assumes neurons in neighborhood are 
correlated 


— Other prior on correlation? 
— Can CNN be further sparsified ? 
Neurons in brain are also sparsely connected, 


and the number of connection gets reduced 
when people grow 





Linear transform: choose the 
direction to reduce space volume 


Nonlinearity: control how much 
volume to be reduced in the 
selected direction and achieve 
invariance 


f(x) = tanh(x) f(x) = max(0, x) 





Classical Deep Models 


: Deep belief net Pre-training: Initial point 
— Hinton’06 ¢ Good initialization point 

¢ Make use of unlabeled data 

P(x,h,,h,) = p(x]h,) p(h,,h,) 

0 E(xhy) 


> ot (x,h, ) ao Y 
x,h, 


P(x,hy) = 





E(x,h,)=b’ X+C' h,+h, Wx | , | : 











Classical Deep Models 


e Auto-encoder 
— Hinton and Salakhutdinov 2006 








Encoding: h, = 0(W,x+b,) 
h, = o(W,h,+b,) 


Decoding: h, = o(W’,h,+b,) 


XK =0(W’,h,+b,) W2 b, 
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Introduction to Deep Learning 


¢ Why does deep learning work? 


Feature Learning vs Feature Engineering 


Feature Engineering 


The performance of a pattern recognition system heavily 
depends on feature representations 


Manually designed features dominate the applications of 
image and video understanding in the past 


Reply on human domain knowledge much more than data 


If handcrafted features have multiple parameters, it is hard to 
manually tune them 


Feature design is separate from training the classifier 
Developing effective features for new applications is slow 


Handcrafted Features for Face Recognition 


2 parameters 3 parameters 


Aye 


8 ana vai § 
‘ea’ 


Geometric features Pixel vector Gabor filters Local binary patterns 


’ Y 


1980s 1992 1997 











Feature Learning 


¢ Learning transformations of the data that make it easier to 
extract useful information when building classifiers or 
predictors 


Learn the values of a huge number of parameters in feature 
representations 


Make better use of big data 


Jointly learning feature transformations and classifiers makes their 
integration optimal 


Faster to get feature representations for new applications 


Deep Learning Means Feature Learning 


¢ Deep learning is about learning hierarchical feature 


representations 
y = F(W*. F(W*" 1. FF... FCW" -x)) 
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¢ Good feature representations should be able to disentangle 
multiple factors coupled in the data 





Identity: face recognition Pixel n 

Pose: pose estimation Pixel 2 ideal 
— >| Feature 

Expression: expression recognition Transform 

Age: age estimation Pixel 1 





expression 


Deep Learning Means Feature Learning 


¢ How to effectively learn features with deep models 
— With challenging tasks 
— Predict high-dimensional vectors 


Feature 
representation 






Fine-tune on 


Pre-train on 
classifying 1,000 —Y) classifying 201 
categories categories 





SVM binary 
classifier for each 
category 


Detect 200 object classes on ImageNet 





W. Ouyang and X. Wang et al. “DeepID-Net: deformable deep convolutional neural 
networks for object detection”, CVPR, 2015 


Training stage A 


Dataset A 


_ 


Distinguish 1000 
categories 








Training stage B Training stage C 


Dataset B Dataset C 


Classifier B | 


Distinguish 201 
categories 





feature 


Fixed 
transform 





Distinguish one 
object class from 
all the negatives 





Example 1: deep learning generic image features 


¢ Hinton group’s groundbreaking work on ImageNet 


— They did not have much experience on general image classification on 
ImageNet 
— |t took one week to train the network with 60 Million parameters 


— The learned feature representations are effective on other datasets 
(e.g. Pascal VOC) and other tasks (object detection, segmentation, 
tracking, and image retrieval) 





pooling 





96 learned low-level filters 





Image classification result 
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Example 2: deep learning face identity features 
by recovering canonical-view face images 
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Reconstruction examples from LFW 











Z. Zhu, P. Luo, X. Wang, and X. Tang, “Deep Learning Identity Preserving Face Space,” ICCV 2013. 


¢ Deep model can disentangle hidden factors through feature 
extraction over multiple layers 


¢ No 3D model; no prior information on pose and lighting condition 
¢ Model multiple complex transforms 


¢ Reconstructing the whole face is a much strong supervision than 
predicting 0/1 class label and helps to avoid overfitting 


Feature Extraction Layers Reconstruction Layer 
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Comparison on Multi-PlE 





LGBP [26] 37.7 62.5 57:2 | 36,8 | 59:3 V 
VAAM [17] 74.1 Q1 95.7 95.7 89.5 748 869 #V 
FA-EGFC[3] 84.7 95 99°53 || 99 92.9 Weare 92.7 X 
SA-EGFC[3] 93 98.7 99.7 99.7 98.3 93.6 97.2 V 


LE[4] + LDA 86.9 955 99.9 99.7 95.5 81.8 93.2 x 
CRBM[9]+LDA 80.3 90.5 949 964 883 89.8 87.6 X 


Ours 95.6 98.5 100.0 99.3 98.5 97.8 98.3 X 
[3] A. Asthana, T. K. Marks, M. J. Jones, K. H. Tieu, and M. Rohith. Fully [17] S. Li, X. Liu, X. Chai, H. Zhang, S. Lao, and S. Shan. Morphable displacement 
automatic pose-invariant face recognition via 3d pose normalization. In ICCV, field a image apa for face recognition across pose. In ECCV, pages 
pages 937-944, 2011. 1, 5,6 102-115. 2012. 1, 2, 5,6 


[4] Z. Cao, Q. Yin, X. Tang, and J. Sun. Face recognition with learning-based [26] W. Zhang, S. Shan, W. Gao, X. Chen, and H. Zhang. Local gabor binary 


descriptor. In CVPR, pages 2707-2714, 2010. 2, 3, 6 pattern histogram sequence (lgbphs): A novel non-statistical model for face 
‘ - _ representation and recognition. In JCCV, volume 1, pages 786-791, 2005. 5, 6 


[9] G. B. Huang, H. Lee, and E. Learned-Miller. Learning hierarchical represen- 
tations for face verification with convolutional deep belief networks. In CVPR, 
pages 2518-2525, 2012. 3,6 


Deep learning 3D model from 2D images, 
mimicking human brain activities 





Z. Zhu, P. Luo, X. Wang, and X. Tang, “Deep Learning and Disentangling Face Representation by Multi-View 
Perception,’ NIPS 2014. 


Deep 
learning 


Training stage A 


Face images in 
arbitrary views 


Reconstruct Reconstruct 
view 1 view 2 


Face reconstruction 





Training stage B 





Two face images 
in arbitrary views 


feature 


Fixed 
transform 


Linear Discriminant 
analysis 


The two images 
belonging to the 
Same person or not 


Face verification 


Example 3: deep learning face identity features 
from predicting 10,000 classes 


¢ At training stage, each input image is classified into 10,000 
identities with 160 hidden identity features in the top layer 


¢ The hidden identity features can be well generalized to other 
tasks (e.g. verification) and identities outside the training set 


e As adding the number of classes to be predicted, the 
generalization power of the learned features also improves 


Convolutional —— 
ayer 1 Convolutional aye 


ayer 2 Convolutiona| Convolutional 
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Y. Sun, X. Wang, and X. Tang. Deep Learning Face Representation by Joint Identification (DeepID) 


Verification. NIPS, 2014. 


Training stage A Training stage B 


Dataset A Dataset B 


— 


Distinguish 
10,000 people 


Face identification Face verification 





feature 


Fixed 
transform 


! 


Linear classifier B 


: 


The two images 






belonging to the 
same person or not 





Deep Structures vs Shallow Structures 
(Why deep?) 


Shallow Structures 


¢ A three-layer neural network (with one hidden layer) can 
approximate any classification function 


¢ Most machine learning tools (such as SVM, boosting, and 
KNN) can be approximated as neural networks with one or 
two hidden layers 


¢ Shallow models divide the feature space into regions and 
match templates in local regions. O(N) parameters are needed 
to represent N regions 


Oriental face Occidental face 
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Deep Machines are More Efficient for 
Representing Certain Classes of Functions 


¢ Theoretical results show that an architecture with insufficient 
depth can require many more computational elements, 
potentially exponentially more (with respect to input size), 
than architectures whose depth is matched to the task 
(Hastad 1986, Hastad and Goldmann 1991) 


e It also means many more parameters to learn 


Take the d-bit parity function as an example 


og, 
X,... X dg, .) 1, if Sj, X%is even 
Paces Sh SAT —1, otherwise 


d-bit logical parity circuits of depth 2 have exponential 
size (Andrew Yao, 1985} 


Reuse partial © @© © 
computation © © @® 
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Shallow structure Deep structure 


There are functions computable with a polynomial-size logic 
gates circuits of depth k that require exponential size when 
restricted to depth k -1 (Hastad, 1986) 


¢ Architectures with multiple levels naturally provide sharing 
and re-use of components 
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Honglak Lee, NIPS’10 


Humans Understand the World through 
Multiple Levels of Abstractions 


¢ We do not interpret a scene image with pixels 


— Objects (sky, cars, roads, buildings, pedestrians) -> parts (wheels, 
doors, heads) -> texture -> edges -> pixels 


— Attributes: blue sky, red car 
¢ |tis natural for humans to decompose a complex problem into 
sub-problems through multiple levels of representations 





bullding 





Humans Understand the World through 
Multiple Levels of Abstractions 


¢ Humans learn abstract concepts on top of less abstract ones 


¢ Humans can imagine new pictures by re-configuring these 
abstractions at multiple levels. Thus our brain has good 
generalization can recognize things never seen before. 


— Our brain can estimate shape, lighting and pose from a face image and 
generate new images under various lightings and poses. That’s why we 
have good face recognition capability. 


Local and Global Representations 


Global representation \ Blue eyes? (1/0) 
‘ Pa 





Local representation 


¢ The way these regions carve the input space still 
depends on few parameters: this huge number of 
regions are not placed independently of each other 


We can thus represent a function that looks 
complicated but actually has (global) structures 
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Human Brains Process Visual Signals 
through Multiple Layers 


¢ A visual cortical area consists of six layers (Kruger et al. 2013} 
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Joint Learning vs Separate Learning 
Manual 
Feature ae 


Feature ees 
Classification 
transform 


Deep learning is a framework/language but not a black-box model 


















Training or 
manual design 


Training or 
manual design 



















Preprocessing 
Step 1 


Preprocessing 
Step 2 
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Data Feature Feature 
collection transform transform 


End-to-end learning 


Its power comes from joint optimization and 
increasing the capacity of the learner 


¢ Domain knowledge could be helpful for designing new 
deep models and training strategies 


¢ How to formulate a vision problem with deep learning? 
— Make use of experience and insights obtained in CV research 
— Sequential design/learning vs joint learning 
— Effectively train a deep model (layerwise pre-training + fine tuning) 
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What if we treat an existing deep model as 
a black box in pedestrian detection ? 


convolutions subsampling convolutions full 
connection 
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convolutions subsampling 


nput lst stage 2nd stage classifier 


ConvNet-U-MS 


— Sermnet, K. Kavukcuoglu, S. Chintala, and LeCun, “Pedestrian Detection with 
Unsupervised Multi-Stage Feature Learning,’ CVPR 2013. 
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N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. 
CVPR, 2005. (6000 citations) 


P. Felzenszwalb, D. McAlester, and D. Ramanan. A Discriminatively Trained, 
Multiscale, Deformable Part Model. CVPR, 2008. (2000 citations) 


W. Ouyang and X. Wang. A Discriminative Deep Model for Pedestrian Detection 
with Occlusion Handling. CVPR, 2012. 


Our Joint Deep Learning Model 


Visibility 
reasoning and 
classification 


odes 


Convolutional Average Convolutional Deformation 
layer 1 pooling layer 2 layer 
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W. Ouyang and X. Wang, “Joint Deep Learning for Pedestrian Detection,” Proc. ICCV, 2013. 


Modeling Part Detectors 


¢ Design the filters in the second 
convolutional layer with variable sizes 
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Part models learned 
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at level 3 
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Full-body Torso 


Head-shoulder 


at level 3 at level 3 at level 2 


Learned filtered at the second 
convolutional layer 


Deformation Layer 
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Visibility Reasoning with Deep Belief Net 
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—— Correlates with part detection score 


Experimental Results 


¢ Caltech — Test dataset (largest, most widely used) 
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Experimental Results 


¢ Caltech — Test dataset (largest, most widely used) 
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Rapid object detection using a boosted cascade of simple features 





P Viola, M Jones - ... Vision and Pattern Recognition, 2001. CVPR ..., 2001 - leeexplore_ieee_org.org 
Abstract [his paper describes a machine learning approach for visual object detection which =| 

IS Capable of processing images extremely rapidly and achieving high detection rates. This 

work Is distinguished by three key contributions. The first is the introduction of a new ... 

Cited by /64/ Related articles All201 versions Importinto Biblex More+ 





Experimental Results 


¢ Caltech — Test dataset (largest, most widely used) 
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N Dalal, B Triggs - ... and Pattern Recognition, 2005. CVPR 2005 . , 2005 - ieeexplore.icee.org 
.. We study the issue of feature sets for human detection, shaven that lo- cally normalized 
Histogram of Oriented Gradient (HOG) de- scriptors provide excellent performance relative 

to other ex- isting feature sets including wavelets [17,22]. ... 

Cited by 5438 Kelated articles All 106 versions Import into BibTeX Morer 
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Object detection with discriminatively trained part-based models 





PF Felzenszwalb, RB Girshick... - Pattern Analysis and ..., 2010 - ieeexplore.ieee.org 
Abstract We describe an object detection system based on mixtures of multiscale 
deformable part models. Our system is able to represent highly variable object classes and 
achieves state-of-the-art results in the PASCAL object detection challenges. While ... 
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Experimental Results 


¢ Caltech — Test dataset (largest, most widely used) 
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W. Ouyang and X. Wang, "A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling,” CVPR 2012. 


W. Ouyang, X. Zeng and X. Wang, "Modeling Mutual Visibility Relationship in Pedestrian Detection ", CVPR 2013. 
W. Ouyang, Xiaogang Wang, "Single-Pedestrian Detection aided by Multi-pedestrian Detection ", CVPR 2013. 

X. Zeng, W. Ouyang and X. Wang, ” A Cascaded Deep Learning Architecture for Pedestrian Detection,” ICCV 2013. 
W. Ouyang and Xiaogang Wang, “Joint Deep Learning for Pedestrian Detection,” IEEE ICCV 2013. 
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false positives per image 


Deformation layer for general object detection 
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Detormation 
penalty 


Deformation layer for repeated patterns 





Pedestrian detection CT=J a=] e-] me) e)(-o me l-inzodtela 


Assume no repeated pattern Repeated patterns 





Deformation layer for repeated patterns 


Pedestrian detection CT=Jal=]e-] me) e)(-o me l-in-odteya 


Assume no repeated pattern Repeated patterns 





Only consider one object class Patterns shared across different object classes 





Deformation constrained pooling layer 


Can capture multiple patterns simultaneously 
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Deep model with deformation layer 
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Net structure AlexNet Clarifai ClarifaitDef layer 


Mean AP on val2 0.299 0.360 0.385 


Large learning capacity makes high dimensional 
data transforms possible, and makes better use 
of contextual information 


¢ How to make use of the large learning capacity of 
deep models? 
— High dimensional data transform 
— Hierarchical nonlinear representations 


Z SVM + feature 
/ , smoothness, shape prior... 


High-dimensional 
data transform 





Input 


Face Parsing 


¢ P. Luo, X. Wang and X. Tang, “Hierarchical Face 
Parsing via Deep Learning, CVPR 2012 





Motivations 


Recast face segmentation as a cross-modality data 
transformation problem 


Cross modality autoencoder 


Data of two different modalities share the same 
representations in the deep model 


Deep models can be used to learn shape priors for 
segmentation 


Training Segmentators 





















































training segmentator: one-layer . ob ——t . 
(b) denoising autoencoder (c) training segmentator: deep autoencoder (d) testing segmentator 





Big data 


Challenging supervision task 
with rich predictions 


Rich information 


How to make use of it? 









Hierarchical 
feature learning 


Capture 
contextual information 


Reduce|capacity 





Go wider 





Domain 


Go deeper knowledge 


Make learning more efficient 


Deep learning = ? 





Machine learning with big data 


Feature learning 


Joint learning 


Contextual learning 


Summary 


Automatically learns hierarchical feature representations from 
data and disentangles hidden factors of input data through 
multi-level nonlinear mappings 


For some tasks, the expressive power of deep models 
increases exponentially as their architectures go deep 


Jointly optimize all the components in a vision and crate 
synergy through close interactions among them 


Benefitting the large learning capacity of deep models, we 
also recast some classical computer vision challenges as high- 
dimensional data transform problems and solve them from 
new perspectives 


It is more effective to train deep models with challenging 
tasks and rich predictions 
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Outline 


¢ Deep learning for object recognition 


Deep Learning Object Recognition 


¢ Deep learning for object recognition on 
ImageNet 

¢ Caption generation from images and videos 

¢ Deep learning for face recognition 


— Learn identity features from joint verification- 
identification signals 


— Learn 3D face models from 2D images 


CNN for Object Recognition on ImageNet 


Krizhevsky, Sutskever, and Hinton, NIPS 2012 


Trained on one million images of 1000 categories 
collected from the web with two GPUs; 2GB RAM on 


each GPU; 5GB of system memory 


Training lasts for one week 


Rank |Name_Errorrate_ | Description 


1 U. Toronto 0.15315 
2 U. Tokyo 0.26172 
3 U. Oxford 0.26979 
4 Xerox/INRIA 0.27058 





Deep learning 


Hand-crafted 
features and 
learning models. 
Bottleneck. 


Model Architecture 


¢ Max-pooling layers follow 15t, 2"%, and 5" convolutional layers 


¢ The number of neurons in each layer is given by 253440, 
186624, 64896, 43264, 4096, 4096, 1000 


¢ 650000 neurons, 60 million parameters, 630 million 
connections 





Max 
pooling 





Normalization 


¢ Normalize the input by subtracting the mean image on the 
training set 





Input image (256 x 256) Mean image 


Krizhevsky 2012 


Activation Function 


¢ Rectified linear unit leads to sparse responses of neurons, 
such that weights can be effectively updated with BP 


f(x) = tanh(x) f(x) = max(0O, x) 





Sigmoid (slow to train) Rectified linear unit (quick to train) / 


Krizhevsky 2012 


Data Augmentation 


¢ The neural net has 60M parameters and it overfits 


¢ Image regions are randomly cropped with shift; their 
horizontal reflections are also included 





Krizhevsky 2012 


Dropout 


Randomly set some input features and the outputs of hidden 
units as zero during the training process 


Feature co-adaptation: a feature is only helpful when other 
specific features are present 


— Because of the existence of noise and data corruption, some features 
or the responses of hidden nodes can be misdetected 


Dropout prevents feature co-adaptation and can significantly 
improve the generalization of the trained network 


Can be considered as another approach to regularization 
It can be viewed as averaging over many neural networks 
Slower convergence 


Classification Result 


bumper car | snow leopard 
golfcart | =eyppen cae 





fire engine || dead-man's-fingers 


Krizhevsky 2012 


Detection Result 


Ralance beam|| - grey f fox|| = 
cinema || | xi} : 
marimba ||. 
parallel bars I 
uter eee | 


English foxhound 
muzzle 





disk brake truck || armadillo || Italian greyhound, 


Kriznhevsky 2012 


Image Retrieval 





Krizhevsky 2012 


Adaptation to Smaller Datasets 


Directly use the feature representations learned from ImageNet and 
replace handcrafted features with them in image classification, scene 
recognition, fine grained object recognition, attribute recognition, image 
retrieval (Razavian et al. 2014, Gong et al. 2014) 


Use ImageNet to pre-train the model (good initialization), and use target 
dataset to fine-tune it (Girshick et al. CVPR 2014) 


Fix the bottom layers and only fine tune the top layers 





Max 
pooling 





GoogLeNet 


¢ More than 20 layers 

¢e Add supervision at multiple layers 

¢ The error rate is reduced from 15.3% to - 
6.6% 





Is computer vision 
a Classification problem? 


An image from ImageNet contains multiple objects 
and class label is not unique 


ImageNet is labeled by human from crowd sourcing 


Recent deep learning result surpassed human 
performance on the ImageNet image classification 
tasks 


How to further improve feature learning? 


Human naturally uses sentences instead of class 
labels to describe images and videos 


Computer vision 


2 Deep learning 


Natural language processing 


Image and video caption generation 


¢ Amore natural way to formulate vision problems is 
to use sentences to describe images and videos 
instead of class labels 


¢ Model sequential data 
P(Y:. cee Lr!) 









Vision Language A group of people 
F |Deep CNN Generating shopping at an 
= RN outdoor market. 


There are many 


vegetables at the 
fruit stand. 


Andrej Karpathy and Li Fei-Fei, “Deep Visual-Semantic Alignments for Generating 
Image Descriptions” CVPR 2015 


@ S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, K. Saenko, 
“Translating Videos to Natural Language Using Deep Recurrent Neural 
Networks,” arXiv: 1412.4729, 2014. 


@ J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. 
saenko, and I. Darrell, “Long-term Recurrent Convolutional Networks for Visual 
Recognition and Description,’ arXiv:1411.4389, 2014. 

@ O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and Tell: A Neural Image 
Caption Generator,” arXiv: 1411.4555, 2014. 


Recurrent neural network (RNN)} 


¢ Model a dynamic system driven by an external signal x, 


he = Fe(hy_1, Xt) 
¢ h, contains information about the whole past sequence. 
The equation above implicitly defines a function which 
maps the whole past sequence (X,,...,x,) to the current 


sate h, = G,(X,,...,X,) 


Nea 





Recurrent neural network (RNN)} 


¢ The summary is lossy, since it maps an arbitrary 
length sequence (x,,...,x,) to a fixed length vector h, . 
Depending on the training criterion, h, keeps some 
important aspects of the past sequence 

¢ Sharing parameters: the same weights are used for 
different instances of the artificial neurons at 
different time steps 


Nes 





Recurrent neural network (RNN)} 


e¢ Share a similar idea with CNN: replacing a fully 
connected network with local connections with 
parameter sharing 

¢ It allows to apply the network to input sequences of 
different lengths and predict sequences of different 
lengths 





Recurrent neural network (RNN)} 


¢ Sharing parameters for any sequence length allows 
more better generalization properties. 


¢ If we have to define a different function G, for each 
possible sequence length, each with its own 
parameters, we would not get any generalization to 
sequences of a size not seen in the training set. One 
would need to see a lot more training examples, 
because a separate model would have to be trained 
for each sequence length. 


Predict a single output at the end of 
the sequence 


¢ Such a network can be used to summarize a 
sequence and produce a fixed-size representation 
used as input for further processing. There might be 
a target right at the end 





Generative RNN modeling 


¢ P(x,,...,X7). It can generate sequences from this distribution 

¢ At the training stage, each x, of the observed sequence serves 
both as input (for the current time step) and as target (for the 
previous time step) 





Vanishing and exploding gradients 


¢ RNN can be treated as a deep net when modeling 
long term dependency 

e¢ After BP through many layers, the gradients become 
either very small or very large 





Leaky units with self-connections 


The new value of the state h,,, is a combination of linear and 
non-linear parts of h, 


The errors are easier to be back propagated through the paths 
of red lines, which are linear 
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Leaky units with self-connections 


¢ tcontrols the rate of forgetting old states. It can be 
viewed as a smooth variant of the idea of the 
previous model 


¢ By associating different time scales t with different 
units, one obtains different paths corresponding to 
different forgetting rates 
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Long Short-Term Memory (LSTM) net 


¢ In the leaky units with self-connections, the forgetting 
rate is constant during the whole sequence 


¢ The role of leaky units is to accumulate information 

over a long duration. However, once that information 

sets used, it might be useful for the neural network to 

forget the old state 

— For example, if a video sequence is composed as 

Subsequences corresponding to different actions, we want a 
leaky unit to accumulate evidence inside each subsequnece, 
and we need a mechanism to forget the old state by setting 
it to zero and starting to count from fresh when starting to 
process the next subsequence 


Long Short-Term Memory (LSTM) net 


¢ The forgetting rates are expected to be different at 
different time steps, depending on their previous 
hidden sates and current input (conditioning the 
forgetting on the context) 


¢ Parameters controlling the forgetting rates are 
learned from train data 


fe = o(WyeXt + Woehy_1 + Be), tt = o(WyiXe + Wr ht_; + bj). 
Oo; = o(Wxoxt + Wroh;_ 71 + Do) 

Qt = tanh(WyxcXt + Wacht_1 + Bc), Cr = fr O C1 +h O Gt 
hy; = 0; © tanh(c;), Z; = softmax(Wp,h; + bz) 











aa nnn mmm mmr” la fa nmr i i 


Xi hey 
ee eee es Cin 
| WA 
| KIN 
[ 

) Input 

, Gate © 

l 

WW. { | 
>. Se 
‘he Input AQ 
| Modulation & 
: Gate , 

| 

, A 

| : | 

. LSTM Unit Forget, 
Gate - | 

ee ee ee w xt PNW 


— 
- 


Long Short-Term Memory (LSTM) net 


¢ The core of LSTM is a memory cell c, which encodes, 
at every time step, the knowledge of the inputs that 
have been observed up to that step 


¢ c, has a linear self-connection similar to the leaky 
units, but the self-connection weight is controlled by 
a forget gate unit f,, that sets this weight to a value 
between 0 and 1 via a sigmoid unit 


fh = o( Wye Xt + Warht_1 + By) 
¢ The input gate unit it is computed similarly to the 
forget gate, but with its own parameters 


Long Short-Term Memory (LSTM) net 


¢ The output ht of the LSTM cell can also be shut off, 
via the output gate oO; (hi = o¢ © tanh(cr)) 


Motivated by language translation 


¢ Model P(y;.....yz7/|x1,....X7). The input and output 


sequences have different lengths, are not aligned, and 
even do not have monotonic relationship 


¢ Use one LSTM to read the input sequence (X,,...,X,), 
one timestep at a time, to obtain a large fixed- 
dimensional vector representation v, which is given by 
the last hidden sate of the LSTM 


|. Sutskever, O. Vinyals, and Q. Le, “Sequence to Sequence Learning with Neural 
Networks,” NIPS 2014. 


Motivated by language translation 


¢ Then conditioned on v, a second LSTM generates the 
output sequence (y,,..., yy ) and computes its 
probability 
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The figure shows a 2-dimensional PCA projection of the LSTM hidden states that 
are obtained after processing the phrases in the figures. The phrases are 
clustered by meaning, which in these examples is primarily a function of word 
order, which would be difficult to capture with a bag-of-words model. The figure 
clearly shows that the representations are sensitive to the order of words, while 
being fairly insensitive to the replacement of an active voice with a passive voice. 


Generate image caption 


¢ Use a CNN as an image encoder and transform it to a 
fixed-length vector 


e It is used as the initial hidden state of a “decoder” 
RNN that generates the target sequence 





A group of people 
shopping at an 
outdoor market. 


There are many 
vegetables at the 
fruit stand. 


O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and Tell: A Neural Image 
Caption Generator,” arXiv: 1411.4555, 2014. 


Translate videos to sentences 


¢ Previous works simplified the problem by detecting a fixed set 
of semantic roles, such as subject, verb, and object, as an 
intermediate representation and adopted oversimplified rigid 
sentence templates 
Input video: 





Machine output: A cat is playing with toy. 
Humans: A Ferret and cat fighting with each other. /A cat and a ferret are playing. /A 
kitten and a ferret are playfully wresting. 


S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, K. Saenko, “Translating Videos 
to Natural Language Using Deep Recurrent Neural Networks,” arXiv: 1412.4729, 2014. 


Input Video Convolutional Net Recurrent Net Output 





playing 


— ball 





Deep Learning Object Recognition 


¢ Deep learning for face recognition 


— Learn identity features from joint verification- 
identification signals 


Deep mearn ng Results on LFW 


Huang etal. CVPR’12 87% 

Sun et al. ICCV’13 92.52% 
Facebook (CVPR’14) 97.35% 
DeepID (CVPR’14) 97.45% 
DeepID2 (NIPS’14) 99.15% 
DeepID2+ (CVPR’15) 99.47% 
Google (CVPR’15) 99.63% 


6+6/7 


18 
18 





Unsupervised 
87,628 
7,000,000 
202,599 
202,599 
450,000 
200,000,000 


The first deep learning work on face recognition was done by Huang et al. in 2012. With 


unsupervised learning, the accuracy was 87% 


Our work at ICCV’13 achieved result (92.52%) comparable with state-of-the-art 
Our work at CVPR’14 reached 97.45% close to “human cropped” performance (97.53%) 
DeepFace developed by Facebook also at CVPR’14 used 73-point 3D face alignment and 7 


million training data (35 times larger than us) 


Our most recent work reached 99.15% close to “human funneled” performance (99.20%) 


Closed- and open-set face 
identification on LFW 


Ciiater: Rank-1 (%) DIR @ 1% FAR (%) 
COST-S1 [1] 56.7 25 
COST-S1+s2 [1] 66.5 35 


DeepFace [2] 64.9 
DeepFace+ [3] 82.5 
DeepID2 91.1 
Deep|ID2+ 95.0 





[1] L. Best-Rowden, H. Han, C. Otto, B. Klare, and A. K. Jain. Unconstrained face recognition: 
Identifying a person of interest from a media collection. TR MSU-CSE-14-1, 2014. 

[2] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the gap to human-level 
performance in face verifica- tion. In Proc. CVPR, 2014. 

[3] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web- scale training for face identification. 
Technical report, arXiv:1406.5266, 2014. 


Eternal Topic on Face Recognition 





Inter-personal variation 


How to separate the two types of variations? 


Are they the same person or not? 





Nicole Kidman Nicole Kidman 


Are they the same person or not? 





Coo d’Este Melina Kanakaredes 


Are they the same person or not? 





Elijah Wood Stefano Gabbana 


Are they the same person or not? 





Jim O’Brien Jim O’Brien — 


Are they the same person or not? 





Jacquline Obradors Julie Taymor | 


¢ Out of 6000 image pairs on the LFW test set, 51 pairs 
are misclassified with the deep model 


¢ We randomly mixed them and presented them to 10 
Chinese subjects for evaluation. Their averaged 
verification accuracy is 56%, close to random guess 
(50%) 


Linear Discriminate Analysis 


. W's; Wy 
Wo = are hiaix ——————_ 
W Ww S, WI 


Si =) (Ry — B)(Ry — B) x (Ky — Ky)(Ke — Ky)! 


oy = > > (x ;— Xz) x; — X;) x S— (x x; — xj) x; — x;)' 


k 7€C) (FER 


W" = are max | ¥ "S,W|) oat. [WS,W| = 1 


LDA seeks for linear feature mapping which maximizes the distance 
between class centers under the constraint what the intrapersonal 
variation is constant 


oe bab oe = rh ye, 
re ) | 
oe 


at, >. |Fixy) — fix; i|- = | 


(i pied? 


Deep Learning for Face Recognition 


¢ Extract identity preserving features through 
hierarchical nonlinear mappings 


¢ Model complex intra- and inter-personal 
variations with large learning capacity 


Learn Identity Features from Different 
Supervisory Tasks 
¢ Face identification: classify an image into one 
of N identity classes 
— multi-class classification problem 


Face verification: verify whether a pair of 
images belong to the same identity or not 
— binary classification problem 


Minimize the intra-personal variation under the constraint 
that the distance between classes is constant (i.e. contracting 
the volume of the image space without reducing the distance 


between classes) 





y=f(x): g=softmax() 


fi =agmin S> [Lfox) — fox) 
GEN, 


sit. [OUP (x) )—gF(x;))| = 1, label (x;) 4 label(x;) 


Learn Identity Features with 
Verification Signal 


e¢ Extract relational features with learned filter pairs 
y) = f (v +k xot +k « 7°) 
¢ These relational features are further processed through 
multiple layers to extract global features 


¢ The fully connected layer can be used as features to combine 
with multiple ConvNets 


Convolutional | Fully- 
“4 Convolutional connected 
ayer 2 Convolutional layer 
layer 3 Convolutiona 








6. 8 
40 . 60 Max-pooling 

Max-pooling layer 3 
layer 2 





20 | 40 
5 20 Max-pooling 
Input layer layer 1 


Y. Sun, X. Wang, and X. Tang, “Hybrid Deep Learning for Computing Face Similarities,” Proc. ICCV, 2013. 


Results on LFW 


¢ Unrestricted protocol without outside training data 


Method Accuracy (%) 

ConvNet-KBM previous [43] 91.75+0.48 7 

VMRS [3] 92.05 + 0.45 © 

CMD+SLBP [23] 92.58 + 1.36 S 

VisionLabs ver. 1.0 [1] 92.90 + 0.31 B a a 3 | 
Fisher vector anes [41] 93.03 2 1.05 ry — ae (unrestrict) [43] 
High-dim LBP [13] 93.18 + 1.07 S VisionLabs ver. 1.0 [1] 

Aurora [19] 93.24 + 0.44 Te Fisher vector faces [41] 
ConvNet-RBM 93.83 + 0.52 —— High-dim LBP (unrestrict) [13] 


Aurora [19] 
ConvNet-RBM (unrestrict) 


0 0.1 0.2 0.3 0.4 
false positive rate 





Results on LFW 


¢ Unrestricted protocol using outside training data 


Method 

Joint Bayesian [12] 
ConvNet-KBM previous [43] 
Tom-vs-Pete (with attributes) [4] 
High-dim LBP [13] 

TL Joint Bayesian [10] 

Conv Net-RBM 


Accuracy (%) 
92.42 + 1.08 
92.52 + 0.38 
93.30 + 1.28 
95.17 +1.13 
96.33 + 1.08 
97.08 + 0.28 


true positive rate 


0.1 


0.2 


Joint Bayesian (WDRef) [12] 
ConvNet-RBM previous (CelebFaces) [43] 
Tom-vs-Pete (with attributes) [4] 

—— High-dim LBP (WDRef) [13] 
——— TL Joint Bayesian [10] 
ConvNet-RBM (CelebFaces) 
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DeepID: Learn Identity Features with 
Identification Signal 
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Y. Sun, X. Wang, and X. Tang, “Deep Learning Face Representation from Predicting 10,000 classes,” Proc. CVPR, 2014. 


During training, each image is classified into 10,000 
identities with 160 identity features in the top layer 


These features keep rich inter-personal variations 


Features from the last two convolutional layers are 
effective 

The hidden identity features can be well generalized 
to other tasks (e.g. verification) and identities 
outside the training set 


Convolutional oo 
layer 1 Convolutional Us 


ayer 2 Convolutional Convolutional a 4 
aver 4 
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features pe 
(DeeplD) 


¢ High-dimensional prediction is more challenging, but 
also adds stronger supervision to the network 


¢ As adding the number of classes to be predicted, the 
generalization power of the learned features also 


improves 


Soft-max 
layer 1 Convolutional layer 

layer 2 Convolutional Convolutional 
| 4 


layer a“ t 
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Extract Features from Multiple ConvNets 


Multiple ConvNets 
| 


n = 10000 n = 10000 


 Deephidden = ..----Mt_-_----___---__- See eesereeeeeeee — 
identity features | 160 eee 160 
(Deep!D) <=<gg>=====5 Gp wanna nnn nnn nnn nnn n= : 


Feature extractin 


layer 4 
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Face patches [-- 


























Learn Identity Features with 
Identification Signal 


e After combining hidden identity features from 
multiple CovNets and further reducing 
dimensionality with PCA, each face image has 150- 
dimenional features as signature 


¢ These features can be further processed by other 
classifiers in face verification. Interestingly, we find 
Joint Bayesian is more effective than cascading 
another neural network to classify these features 


DeepID2: Joint Identification- 
Verification Signals 


¢ Every two feature vectors extracted from the same 
identity should are close to each other 


Verif( f;, 7... 4. 52) = 2 L . JuU2 9 429 | 
—_ —Fmax (0,m— [fi fille)” if yay = -1 


f,and f, are feature vectors extracted from two face images in comparison 


y;,= 1 means they are from the same identity; y,, = -l1means different identities 


m is a margin to be learned 


Y. Sun, X. Wang, and X. Tang. Deep Learning Face Representation by Joint Identification-Verification. 
NIPS, 2014. 


Balancing Identification and 
Verification Signals with Parameter A 
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A =0: only identification signal 
AX. = +e°: only verification signal 


Rich Identity Information Improves 
Feature Learning 


¢ Face verification accuracies with the number of 
training identities 
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Summary of Deep!ID2 


e 25 face regions at different scales and locations 
around landmarks are selected to build 25 neural 
networks 

¢ All the 160 X 25 hidden identity features are further 
compressed into a 180-dimensional feature vector 
with PCA as a signature for each image 


¢ With a single Titan GPU, the feature extraction 
process takes 35ms per image 


DeepID2+ 


e Larger net work 
structures Conv-4 














¢ Larger training data 






F : Conv-3 > r 
¢ Adding supervisory "(90006 r ve Kd 
signals at every layer 4) (90000) 2 (00006 
TN 3 
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Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. 
CVPR, 2015. 


Compare DeepID2 and DeepID2+ on LFW 


W} BDeepiD2 
MDeep!D2+ 





20 29 


0 13 
net ID 


Comparison of face verification accuracies on LFW with ConvNets trained on 25 face 
regions given in DeepID2 


Best single model is improved from 96.72% to 98.70% 


Final Result on LFW 


High-dim_ | TL Joint DY =1=] 0] g- (X= am DY ==) 0) | Dim DY =1-) 0) | DAI DY ==) 0] | DY 
LBP [1] Bayesian [2] | [3] [4] [5] [6] 


Accuracy (%) 95.17 96.33 97.35 97.45 99.15 99.47 





[1] Chen, Cao, Wen, and Sun. Blessing of dimensionality: High-dimensional feature and 
its efficient compression for face verification. CVPR, 2013. 


[2] Cao, Wipf, Wen, Duan, and Sun. A practical transfer learning algorithm for face 
verification. ICCV, 2013. 


[3] Taigman, Yang, Ranzato, and Wolf. DeepFace: Closing the gap to human-level 
performance in face verification. CVPR, 2014. 


[4] Sun, Wang, and Tang. Deep learning face representation from predicting 10,000 
classes. CVPR, 2014. 


[5] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep Learning Face Representation by Joint 
Identification-Verification. NIPS, 2014. 


[6] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, 
selective, and robust. CVPR, 2015. 


Closed- and open-set face 
identification on LFW 


Ciiater: Rank-1 (%) DIR @ 1% FAR (%) 
COST-S1 [1] 56.7 25 
COST-S1+s2 [1] 66.5 35 


DeepFace [2] 64.9 
DeepFace+ [3] 82.5 
DeepID2 91.1 
Deep|ID2+ 95.0 





[1] L. Best-Rowden, H. Han, C. Otto, B. Klare, and A. K. Jain. Unconstrained face recognition: 
Identifying a person of interest from a media collection. TR MSU-CSE-14-1, 2014. 

[2] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the gap to human-level 
performance in face verifica- tion. In Proc. CVPR, 2014. 

[3] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web- scale training for face identification. 
Technical report, arXiv:1406.5266, 2014. 


Face Verification on YouTube Faces 
‘Methods Accuracy (%) 





LM3L [1] 81.34+1.2 
DDML (LBP) [2] 81.34+1.6 
DDML (combined) [2] 82.3 +1.5 
EigenPEP [3] 84.8 +1.4 
DeepFace [4] 91.4 +1.1 
DeepID2+ 93.2 +0.2 


[1] J. Hu, J. Lu, J. Yuan, and Y. P. Tan, “Large margin multi-metric learning for face and 
kinship verification in the wild,’ ACCV 2014 


[2] J. Hu, J. Lu, and Y. P. Tan, “Discriminative deep metric learning for face verification in 
the wild,” CVPR 2014 


[3] H. Li, G. Hua, X. Shen, Z. Lin, and J. Brandt, “Eigen-pep for video face recognition,” 
ACCV 2014 


[4] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “DeepFace: Closing the gap to human- 
level performance in face verification,’ CVPR 2014. 
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Tr3.U 


¢ Linear transform 





¢ Pooling 





¢ Nonlinear mapping 


Unified subspace analysis 


¢ Identification signal is in S,; 


verification signal is in S, 


Maximize distance between 
classes under constraint 
that intrapersonal variation 
is constant 


Linear feature mapping 


Joint deep learning 


Learn features by joint 
identification-verification 


Minimize intra-personal 
variation under constraint 
that the distance between 
classes is constant 


Hierarchical nonlinear 
feature extraction 


Generalization power increases 
with more training identities 


What has been learned by DeepID2+? 


Properties owned by neurons? 


Moderate sparse 


Selective to identities and attributes 
Robust to data corruption 


These properties are naturally owned by DeepID2+ through large-scale training, 
without explicitly adding regularization terms to the model 


Biological Motivation 





Dorsal 


t, Ant. 


¢ Monkey has a face-processing network that is made of six 
interconnected face-selective regions 


¢ Neurons in some of these regions were view-specific, while 
some others were tuned to identity across views 


¢ View could be generalized to other factors, e.g. expressions? 


Winrich A. Freiwald and Doris Y. Tsao, “Functional compartmentalization and viewpoint generalization 
within the macaque face-processing system,” Science, 330(6005):845—851, 2010. 


Deeply learned features are moderately space 


¢ Foran input image, about half of the neurons are activated 
e An neuron has response on about half of the images 


George W Bush Background 
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Deeply learned features are moderately space 


¢ The binary codes on activation patterns of neurons are very 
effective on face recognition 


¢ Activation patterns are more important than activation 
magnitudes in face recognition 


a Joint Bayesian (%) | Hamming distance (%) 


Single model 98.70 
(real values) 





Single model 97.67 96.46 
(binary code) 

Combined model 99.47 n/a 
(real values) 

Combined model 99.12 97.47 


(binary code) 


Deeply learned features are selective to 
identities and attributes 


¢ With a single neuron, DeepID2 reaches 97% recognition 
accuracy for some identity and attribute 


George W Bush Background 
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Deeply learned features are selective to 
identities and attributes 


¢ With a single neuron, DeepID2 reaches 97% recognition 
accuracy for some identity and attribute 
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Deeply learned features are selective to 
identities and attributes 


¢ Excitatory and inhibitory neurons 
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Histograms of neural activations over identities with the most images in LFW 
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Deeply learned features are selective to 


identities and attributes 


¢ Excitatory and inhibitory neurons 
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Histograms of neural activations over gender-related attributes (Male and Female) 
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Histograms of neural activations over race-related attributes (White, Black, Asian and India) 
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Histogram of neural activations over age-related attributes (Baby, Child, Youth, Middle Aged, and Senior) 
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Deeply learned features are selective to 
identities and attributes 


e Visualize the semantic meaning of each neuron 


High Resp. <== j\LowResp. HighResp. <=> 3 Low Resp. 
Gender js Evim@Xe) le) 





Deeply learned features are selective to 
identities and attributes 


e Visualize the semantic meaning of each neuron 


Test Image Activations Neurons 





Neurons are ranked by their responses in descending order with respect to test images 


Deep|ID2 features for attribute recognition 


e Features at top layers are more effective on recognizing 
identity related attributes 

¢ Features at lowers layers are more effective on identity-non- 
related attributes 


Top hidden layer Lower convolution layers 
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Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep Learning Face Attributes in the Wild,” ICCV 2015 


Deep|ID2 features for attribute recognition 


DeepID2 features can be directly used for attribute recognition 


Use DeelD2 features as initialization (pre-trained result), and 
then fine tune on attribute recognition 


Average accuracy on 40 attributes on CelebA and LFWA datasets 





FaceTracer [1] (HOG+SVM) 81 74 
PANDA-W [2] 79 71 
(Parts are automatically detected) 

PANDA-CL [2] 85 81 
(Parts are given by ground truth) 

Training CNN from scratch with 83 79 
attributes 

Directly use Deep|ID2 features 84 82 


DeepID2 + fine-tune 87 84 


verification accuracy 


Deeply learned features are robust to occlusions 


¢ Global features are more robust to occlusions 
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Outline 


¢ Deep learning for face recognition 


— Learn 3D face models from 2D images 


Deep Learning Multi-view 
Representation from 2D Images 


¢ Inspired by brain behaviors [Winrich et al. Science 2010] 
¢ Identity and view represented by different sets of neurons 


¢ Given an image under arbitrary view, its viewpoint can be 
estimated and its full soectrum of views can be reconstructed 





Z. Zhu, P. Luo, X. Wang, and X. Tang, “Deep Learning and Disentangling Face Representation by Multi-View Perception,” 
NIPS 2014. 


Deep Learning Multi-view 
Representation from 2D Images 


xX and y are input and ouput images of 
the same identity but in different views; 


v is the view label of the output image; 





h'’7 are neurons encoding identity 
features 





h’ are neurons encoding view features 
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Uo h’ are neurons encoding features to 
y reconstruct the output images 


Raw Pixels+LDA 
LBP [| ]+LDA 
Landmark LBP [6]+LDA 


CNN+LDA 

FIP [28]+LDA 
RL [28]+LDA 
MTL+RL+LDA 


MVP),ia+LDA 
MVP,,;a+LDA 
MVPyr+LDA 
MVP), +LDA 


Avg. 


61.5 
79.3 


72.6 
62.3 


0° 


92.5 
95.7 


91.0 
83.4 


—15° 


85.4 
93.3 


86.7 
fe: 


ay Ps 


84.9 
92.2 


84.1 
73.1 


—30° 


64.3 
83.4 


74.6 
62.0 


+30° 


67.0 
83.9 


74.2 
63.9 


—45° 


51.6 
73.2 


68.5 
Di 


A 


45.4 
70.6 


63.8 
53.2 


—60° 


oul 
60.2 


55.7 
44.4 


+60° 





28.3 
60.0 


56.0 
46.9 


Face recognition accuracies across views and illuminations on the Multi-PIE 
dataset. The first and the second best performances are in bold. 


[1] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face 


recognition. TPAMI, 28:2037—2041, 2006. 


[6] Dong Chen, Xudong Cao, Fang Wen, and Jian Sun. Blessing of dimensionality: High-dimensional feature 


and its efficient compression for face verification. In CVPR, 2013. 


[28] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity preserving face space. In JCCV, 2013. 


Deep Learning Multi-view 
Representation from 2D Images 


¢ Interpolate and predict images under viewpoints unobserved 
in the training set 
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The training set — has viewpoints of 0°, 30°, and 60°. - en reconstructed 
images under 15° and 45° when the input is taken under 0°. (b) The input images 
are under 15° and 45°. 


Outline 


¢ Deep learning for object segmentation 


Whole-image classification vs 
pixelwise classification 


Whole-image classification: predict a single label for 
the whole image 

Pixelwise classification: predict a label at every pixel 
— Segmentation, detection, and tracking 

CNN, forward and backward propagation were 
originally proposed for whole-image classification 
Such difference was ignored when CNN was applied 
to pixelwise classification problems, therefore it 
encountered efficiency problems 


Pixelwise Classification 


¢ Image patches centered at each pixel are used as the 
inout of a CNN, and the CNN predicts a class label for 


each pixel 
¢ Alot of redundant computation because of overlap 
between patches Image patches around 


each pixel location 








Lie i 
wry Pula - 


"3 = Lh. a ~ ; = a“ si | “ = >: | - 
> is 7 

= i ALL 
. ———a | oN Ef 


4 


Trained CNN 





Class label for each pixel 


Farabet et al. TPAMI 2013 ~=Pinheiro and Collobert ICML 2014 


Classify Segmentation Proposal 


¢ Determines which segmentation proposal can best 
represent objects on interest 
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R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic 
Segmentation” CVPR 2014 


Direct Predict Segmentation Maps 
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P. Luo, X. Wang, and X. Tang, “Pedestrian Parsing via Deep Decompositional Network,” ICCV 2013. 


Direct Predict Segmentation Maps 


¢ Classifier is location sensitive has no 
translation invariance 
— Prediction not only depends on the neighborhood 
of the pixel, but also its location 
¢ Only suitable for images with regular 
structures, such as faces and humans 


Efficient Forward-Propagation of Convolutional 
Neural Networks 


e Generate the same result as patch-by-patch scanning, with 1500 
times speedup for both forward and backward propagation 


ae lb ONN Predictions Labels Target Label Map 









( a Patch- cay: patch scanning for CNN based pixelwise classification 


Prediction Map Target Label Map 





Rormned Paclkisand Selecting Errors on 


Propagation Propagation —_- Pixels via Error Mask 
(b) Our approach 


H. Li, R. Zhao, and X. Wang, “Highly Efficient Forward and Backward Propagation of Convolutional 
Neural Networks for Pixelwise Classification,” arXiv:1412.4526, 2014 
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The layewise timing and speedup results of the forward and backward propagation 
by our proposed algorithm on the RCNN model with 3X410X410 images as inputs. 


Fully convolutional neural network 


¢ Replace fully connected layers in CNN with 1 x 1 
convolution kernel just like “network in network” 
(Lin, Chen and Yan, arXiv 2013) 


¢ Take the whole images as inputs and directly output 
segmentation map 


¢ Has translation invariance like patch-by-patch 
scanning, but with much lower computational cost 


¢ Once FCNN is learned, it can process input images of 
any sizes without warping them to a standard size 


K. Kang and X. Wang, “Fully Convolutional Neural Networks for Crowd Segmentation,” arXiv:1411.4464, 2014 


Fully convolutional neural network 



















































































(a) CNN Patch-scanning (b) CNN Regression (c) FCNN Segmentation 


Convolution-pooling layers 





OOCOOCOCOOCO) | Fully connected layers Fusion” convolutional layers 


implemented by 1 x 1 kernel 


Saliency detection 


¢ Incorporate semantic information into 
saliency detection 


Image 





R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency Detection by Multi-Context Deep Learning,” 
CVPR 2015 


Saliency detection 


¢ Global and local context 


Local Context Global Context 





Saliency detection 


¢ Multi-context modeling 


Global—context Modeling 
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F-measure Score 


Saliency detection 


Multi-context modeling 
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Saliency detection 


¢ Different network structures 


F-measure Score 
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e F-measure scores of benchmarking approaches on 
five public datasets 


IS [20] 
GBVS [17] 
SF [44] 
GC [12] 
CEOS [40] 
PCAS [41] 
GBMR [5 /] 
HS [56] 
DRFI [25] 
Ours 


ASD 
0.5943 
0.6499 
0.8879 
0.8811 
0.9020 
0.8613 
0.9100 
0.9307 
0.9448 
0.9548 


SED 1 
0.5540 
0.7125 
i333 
0.8066 
0.7935 
0.7586 
0.9062 
0.8/44 
0.9018 
0.9295 


SED2 


0.5682 
0.5862 


0.7961 
0.7728 
0.6198 
0.7791 
0.7974 
0.8150 
0.8725 
0.8903 


ECSSD 


0.4731 
0.5528 
0.5448 
0.5821 
0.6465 
0.5800 
0.6570 
0.6391 
0.6909 
0.7322 


PASCAL-S 


0.4901 
0.5929 
0.5740 
0.6184 
0.6557 
0.6332 
0.7055 
0.6819 
0.7447 
0.7930 


Summary 


Deep learning significantly outperforms conventional vision 
systems on large scale image classification 


Feature representation learned from ImageNet can be well 
generalized to other tasks and datasets 


In face recognition, identity preserving features can be 
effectively learned by joint identification-verification signals 


3D face models can be learned from 2D images; identity and 
pose information is encoded by different sets of neurons 


In segmentation, larger patches lead to better performance 
because of the large learning capacity of deep models. It is 
also possible to directly predict the segmentation map. 


The efficiency of CNN based segmentation can be significantly 
improved by considering the differences between whole- 
image classification and pixelwise classification 


References 


A. Krizhevsky, L. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep 
Convolutional Neural Networks,” Proc. NIPS, 2012. 

G. B. Huang, H. Lee, and E. Learned-Miller, “Learning Hierarchical Representation 
for Face Verification with Convolutional Deep Belief Networks,” Proc. CVPR, 2012. 
Y. Sun, X. Wang, and X. Tang, “Hybrid Deep Learning for Computing Face 
Similarities,” Proc. ICCV, 2013. 

Y. Sun, X. Wang, and X. Tang, “Deep Learning Face Representation from Predicting 
10,000 classes,” Proc. CVPR, 2014. 

Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “DeepFace: Closing the Gap to 
Human-Level Performance in Face Verification,” Proc. CVPR, 2014. 

Y. Sun, X. Wang, and X. Tang, “Deep Learning Face Representation by Joint 
Identification-Verification,” NIPS, 2014. 

A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “CNN Features off-the-shelf: 
an Astounding Baseline for Recognition,” arXiv preprint arXiv:1403.6382, 2014. 

Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-Scale Orderless Pooling of Deep 
Convolutional Activation Features,” arXiv preprint arXiv:1403.1840, 2014. 


M. Turk and A. Pentland, “Eigenfaces for Recognition,” Journal of Cognitive 
Neuroscience, Vol. 3, pp. 71-86, 1991. 

P.N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. Fisherfaces: 
Recognition Using Class Specific Linear Projection,’ TPAMI, Vol. 19, pp. 711-720, 
1997. 

B. Moghaddam, T. Jebara, and A. Pentland, “Bayesian Face Recognition,” Pattern 
Recognition, Vol. 33, pp. 1771-1782, 2000. 

X. Wang and X. Tang, “A Unified Framework for Subspace Face Recognition,’ TPAMII, 
Vol. 26, pp. 1222-1228, 2004. 

Z. Zhu, P. Luo, X. Wang, and X. Tang, “Deep Learning and Disentangling Face 
Representation by Multi-View Perception,’ NIPS 2014. 

C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning Hierarchical Features for 
Scene Labeling”, TPAMI, Vol. 35, pp. 1915-1929, 2013. 

P.O. Pinheiro and R. Collobert, “Recurrent Convolutional Neural Networks for 
Scene Labeling”, Proc. ICML 2014. 

R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hierarchies for 
Accurate Object Detection and Semantic Segmentation” CVPR 2014 

P. Luo, X. Wang, and X. Tang, “Pedestrian Parsing via Deep Decompositional 
Network,” ICCV 2013. 

Winrich A. Freiwald and Doris Y. Tsao, “Functional compartmentalization and 
viewpoint generalization within the macaque face-processing system,” Science, 
330(6005):845—-851, 2010. 

Shay Ohayon, Winrich A. Freiwald, and Doris Y. Tsao. What makes a cell face 
selective? the importance of contrast. Neuron, 74:567-581, 2013. 


Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, 
selective, and robust. CVPR, 2015. 


Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep Learning Face Attributes in the Wild,” 
arXiv:1411.7766, 2014. 


H. Li, R. Zhao, and X. Wang, “Highly Efficient Forward and Backward Propagation of 
Convolutional Neural Networks for Pixelwise Classification,” arXiv:1412.4526, 2014. 


K. Kang and X. Wang, “Fully Convolutional Neural Networks for Crowd 
Segmentation,’ arXiv:1411.4464, 2014 


Outline 


Introduction to deep learning 

Deep learning for object recognition 
Deep learning for object segmentation 
Deep learning for object detection 
Deep learning for object tracking 
Open questions and future works 


Part IV: Deep Learning for Object 
Detection 


¢ Pedestrian Detection 
¢ Human part localization 
¢ General object detection 






Deep learning 





Human pose estimation 


Pedestrian detection 


Deep Learning for Object Detection 


Jointly optimize the detection pipeline 


Multi-stage deep learning (cascaded detectors) 


Mixture components 


Integrate segmentation and detection to 
depress background clutters 


Contextual modeling 


Pre-training 


Model deformation of object parts, which are 
shared across classes 


Joint Deep Learning: 


< Jointly optimize the detection pipeline 


What if we treat an existing deep model as 
a black box in pedestrian detection ? 


convolutions subsampling convolutions full 
connection 
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ConvNet-U-MS 


— Sermnet, K. Kavukcuoglu, S. Chintala, and LeCun, “Pedestrian Detection with 
Unsupervised Multi-Stage Feature Learning,’ CVPR 2013. 
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Our Joint Deep Learning Model 


Visibility 
reasoning and 
classification 


Convolutional Average Convolutional Deformation 
layer 1 pooling layer 2 layer 
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W. Ouyang and X. Wang, “Joint Deep Learning for Pedestrian Detection,” Proc. ICCV, 2013. 


Modeling Part Detectors 


¢ Design the filters in the second 
convolutional layer with variable sizes 
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Results on Caltech Test 
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Pedestrian Detection aided by Deep 
Learning Semantic Tasks 


~~ Improve feature learning with extra semantic tasks 


Y. Tian, P. Luo, X. Wang, and X. Tang, "Pedestrian Detection aided by Deep Learning Semantic Tasks," CVPR 2015 


Pedestrian Detection aided by Deep 
Learnin 
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Y. Tian, P. Luo, X. Wang, and X. Tang, “Pedestrian Detection aided by Deep Learning 
Semantic Tasks,” CVPR 2015 
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Pedestrian Detection on Caltech 
(average miss detection rates} 


HPG+SVM 
63% DPM 


63% 








Joint DL 


39% 
° DL aided by 


semantic tasks 
17% 


W. Ouyang and X. Wang, “Joint Deep Learning for Pedestrian Detection,” ICCV 2013. 


Y. Tian, P. Luo, X. Wang, and X. Tang, “Pedestrian Detection aided by Deep Learning 
Semantic Tasks,” CVPR 2015. 
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Multi-Stage Contextual Deep Learning: 


<> Train different detectors for different types of samples 
<> Model contextual information 
<> Stage-by-stage pretraining strategies 





X. Zeng, W. Ouyang and X. Wang, "Multi-Stage Contextual Deep Learning for Pedestrian Detection," ICCV 2013 


Motivated by Cascaded Classifiers and 
Contextual Boost 


¢ The classifier of each stage deals with a specific set 
of samples 


¢ The score map output by one classifier can serve as 
contextual information for the next classifier 





“* Only pass one detection 
score to the next stage 

“* Classifiers are trained 
sequentially 





Conventional cascaded classifiers for detection 


¢ Simulate the cascaded classifiers by mining hard samples to train the network 
Stage-by-stage 

¢ Cascaded classifiers are jointly optimized instead of being trained sequentially 

¢ The deep model keeps the score map output by the current classifier and it 
serves as contextual information to support the decision at the next stage 

¢ To avoid overfitting, a stage-wise pre-training scheme is proposed to regularize 
optimization 
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Training Strategies 


Unsupervised pre-train W, ,,, layer-by-layer, setting W, ,,, = 0, F;,, =0 


Fine-tune all the W, ;,, with supervised BP 
Train F,,, and W, ;,, with BP stage-by-stage 


A correctly classified sample at the previous stage does not influence the 


update of parameters 


Stage-by-stage training can be considered as adding regularization 
constraints to parameters, i.e. some parameters are constrained to be 


zeros in the early training stages 
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Log error function: 


E=—llogy — (1 —1) log (1 — y) 


Gradients for updating parameters: 
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Comparison of Different Training Strategies 
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Network-BP: use back propagation to update all the parameters without pre-training 
PretrainTransferMatrix-BP: the transfer matrices are unsupervised pertrained, and then 
all the parameters are fine-tuned 

Multi-stage: our multi-stage training strategy 





Switchable Deep Network 


<> Use mixture components to model complex variations of 
body parts 


<> Use salience maps to depress background clutters 


~~ Help detection with segmentation information 


P. Luo, Y. Tian, X. Wang, and X. Tang, “Switchable Deep Network for Pedestrian Detection", CVPR 2014 


Switchable Deep Network for 
Pedestrian Detection 





Background clutter and large variations of pedestrian 
appearance. 

Proposed Solution. A Switchable Deep Network (SDN) 
for learning the foreground map and removing the effect 
background clutter. 


Switchable Deep Network for 
Pedestrian Detection 


¢ Switchable Restricted Boltzmann Machine 
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Switchable Deep Network for 
Pedestrian Detection 


¢ Switchable Restricted Boltzmann Machine 
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Switchable Deep Network for 
Pedestrian Detection 
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(bo) Performance on ETH 


Human Part Localization 


<> Contextual information is important to segmentation as 
well as detection 


Human part localization 


¢ Facial Keypoint Detection 
¢ Human pose estimation 





Facial Keypoint Detection 


e Y. Sun, X. Wang and X. Tang, “Deep Convolutional Network 
Cascade for Facial Point Detection,’ CVPR 2013 
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Comparison with Liang et al. [6], Valstar et al. [7], Laxand Face SDK [1] and Microsoft 
Research Face SDK [2] on BioID and LFPW. 
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http://www.luxand.com/facesdk/ 

http://research.microsoft.com/en-us/projects/facesdk/. 

O. Jesorsky, K. J. Kirchberg, and R. Frischholz. Robust face detection using the hausdorff distance. In Proc. AVBPA, 2001. 
P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts of faces using a consensus of exemplars. In Proc. CVPR, 2011. 
X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape regression. In Proc. CVPR, 2012. 

L. Liang, R. Xiao, F. Wen, and J. Sun. Face alignment via component-based discriminative search. In Proc. ECCV, 2008. 

M. Valstar, B. Martinez, X. Binefa, and M. Pantic. Facial point detection using boosted regression and graph models. In Proc. CVPR, 2010. 


Validation. 





Benefits of Using Deep Model 


¢ The first network that takes the whole face as input needs 
deep structures to extract high-level features 


¢ Take the full face as inout to make full use of texture context 
information over the entire face to locate each keypoint 
e Since the networks are trained to predict all the keypoints 


simultaneously, the geometric constraints among keypoints 
are implicitly encoded 


Human pose estimation 


¢ W. Ouyang, X. Chu and X. Wang, “Multi-source Deep 
Learning for Human Pose Estimation” CVPR 2014. 





Multiple information sources 


¢ Appearance 





Multiple information sources 


¢ Appearance 
¢ Appearance mixture type 








Multiple information sources 


¢ Appearance 
¢ Appearance mixture type 
¢ Deformation 




















Multi-source deep model 





Experimental results 





Method [Torso |u.leg [Lleg | U.arm |tarm | head | Total 


Yang&Ramanan CVPR'11 82.9 68.8 60.5 63.4 42.4 82.4 63.6 
Multi-source deep learning 9,3 78.0 72.0 67.8 47.8 89.3 71.0 





\/=y 0 aveye| 
Yang&Ramanan CVPR'11 81.8 65.0 55.1 46.8 37.7 79.8 57.0 
Multi-source deep learning 89,1 72.9 62.4 56.3 47.6 89.1 65.6 





\/=y 0 aveye| 
Yang&Ramanan CVPR'11 82.9 70.3 67.0 56.0 39.8 79.3 62.8 
Multi-source deep learning 85.8 76.5 72.2 63.3 46.6 83.1 68.6 


Up to 8.6 percent accuracy improvement with global geometric constraints 


Experimental results 





Left: mixtire-of-parts (Yang&Ramanan CVPR’11) 
Right: Multi-source deep learning 


General Object Detection 


<> Pretraining 
<> Model deformation of object parts, which are shared across 


classes 
<> Contextual modeling 


ImageNet Object Detection Task 
(2013) 


¢ 200 angen © classes 
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Challenges -- person 


>» Intra-class variation 


>» Part existence 





Challenges -- person 


e Intra-class variation 
e Part existence 
° Color 





Challenges -- person 








e Intra-class variation 


e Part existence 
° Color 











° Occlusion 


Challenges 


Intra-class variation 


Part existence 


Color 


Occlusion 


Deformation 
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Mean Average Precision (mAP) 


DeepID-Net 
GoogLeNet 50.3% 
43.9% 


RCNN 


UvA-Euvision 31.4% 
22.581% 
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W. Ouyang and X. Wang, et al. “DeepID-Net: Deformable Deep Convolutional Neural 
Networks for Object Detection,” CVPR 2015 763 


PASCAL VOC (SIFT, HOG, DPM...) 
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PASCAL VOC challenge dataset 


4 Post- 
competition 
results (2013 - 
present) 


° Top 
competition 
results (2007 - 
2012) 
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PSCAL VOC (CNN features] 


R-CNN 
58.5% 


VOC’07 


R-CNN 
53.3% 
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compeiition 
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© Top 
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voc’os voc’09 Vvoc’10 Vvoc’ll VOC’12 
PASCAL VOC challenge dataset 
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A Our current result 
73.9% 


A DeepID-Net 64.1% 


PSCAL VOL (CNN features} 
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compeiition 
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PASCAL VOC challenge dataset 
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Object Detection on ImageNet 


RCNN (mean average precision: 31.4%) 
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Image Proposed Detection ‘Refined 


bounding boxes results bounding boxes 


DeepID-Net (mean average precision: 50.3%) 





DeepID-Net 
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Consideration for deep learning based 
general object detection 
¢ Time 
— Test 
— Training 
e Accuracy 


— Learning discriminative and invariant features 





— Capture complex deformation and 
parts 


— Rich contextual information 


. MAP 31 mam) to 50.3 


Our pipeline 






Selective 


Bounding boxes Remaining det- C Ing 
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Object detection — old framework 


¢ Sliding window 
e Feature extraction 
¢ Classification 





For each window size 
For each window 
1. Feature extraction 
2. Classification 
End; 
End; 


2015/9/14 2/0 


Object detection — the framework 





Sliding 
window 
¢ Sliding window 
e Feature extraction 
¢ Classification dicabivelin 
exaction 


For each window size 
For each window 


1. Feature extraction Feature vector: 
2. Classification X= [X) X_ Xz Xy.--] 
End; 


End; 





Object detection — the framework 


Sliding 
window 


¢ Classification 





Feature 
exaction 


For each window size 
For each window 
1. Feature extraction Feature vector: ¢ 


2. Classification X= [X) X2 %3 X44... 
End: ania | 


End; 





Object or not? 


Problem of sliding windows 


Single-scale detection: 10k to 100k windows per image 
Multi-scale detection: 100k to 1m windows per image 
Multiple aspect ratio:10m to 100m windows per image 


Selective search: 2k windows per image of multiple scales and 
aspect ratios 


Selective 


search 
Cc» 





Selective 


search, 
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Selective search 





Image Bounding icin 
¢ Initial segments from over-segmentation 
[Felzenszwalb2004] 
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Selective 





h , 
searc 3 
Selective search ‘eo = 
Image Bounding boxes 
¢ Initial segments from over-segmentation 
[Felzenszwalb2004] 


¢ Based on hierarchical grouping 
¢ Group adjacent regions on region-level similarity 
¢ Consider all scales of the hierarchy 
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Our investigation 


Speed-up the pipeline 
Effectively learn the deep model 


Make use of domain knowledge from 
computer vision 

— Deformation pooling 

— Context modelling 


MAP 31 mam) to 50.57 on val2 


Deep|ID-Net 


Box 





Proposed Remaining 
bounding boxes bounding boxes 
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W. Ouyang and X. Wang et al. “Deep|ID-Net: deformable deep convolutional neural 
networks for object detection”, CVPR, 2015 

















Bounding box rejection , *~ 
¢ Motivation : 


— Selective search: ~ 2400 bounding boxes per image 





— Feature extraction using AlexNet 
e ILSVRC val: ~20,000 images, ~2.4 days 
e ILSVRC test: “40,000 images, ~4.7days 


¢ Bounding box rejection by RCNN: 


— For each box, RCNN has 200 scores S, 5,, for 200 classes 
— If max(S, 599) < -1.1, reject. 6% remaining bounding boxes 


Recall (val, ) 92.2% 89.0% 84.4% 


Feature extraction time (seconds per image) 10.24 2.88 1.18 
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¢ Speed up the pipeline 
— Save the feature extraction time by about 10 times. 


¢ Improve mean AP by 1% 


All 
Testing SVM score 


feature extraction (val2) 


mw With bbox rejection 
ran aa mw Without bbox rejection 


feature extraction (val1) 








finetuning 
0 20 40 60 80 100 
hours 
Recall (val,) 92.2% 89.0% 84.4% 


Feature extraction time (seconds per image) 10.24 2.88 1.18 ° 
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Deep|D-Net 


Deep!D-Net 
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Deep learning is feature learning 






Sa ies tT a 


Image classification Tracking 


Segmentation 


Features learned on ImageNet 


Learning features and classifiers separately 


¢ How to effectively learn features ? 
— With challenging tasks 
— Predict high-dimensional vectors 


Training ; Training 


> 


Deep 
learning 





Classifier 1 Classifier 2 


Prediction Prediction 
on task 1 on task 2 oe 


Classifier B 


Prediction on task B 
(Our target task) 







Directly training 200 binary classifiers with CNNs are not good 






Fine-tune on 


Pre-train on 
classifying 1,000 —y classifying 201 
categories categories 


Detect 200 object classes on ImageNet 


Girshick, Ross, et al. CVPR, 2014 





Feature 
representation 


SVM binary 
classifier for each 
category 





Why need pre-training with many classes? 


¢ Each sample carries much more information 


¢ One big negative class with many types of 
objects confuses CNN on feature learning 


¢ Make the training task challenging, not easy to 
overtit 


Feature learning 


¢ Pretrain for image-classification with 1000 classes 
¢ Finetune for object-detection with 200+1 classes 


— Transfer the representation learned from ILSVRC 
Classification to PASCAL (or ImageNet) detection 


¢ Use the fine-tuned features for learning SVM 





Girshick, Ross, et al. CVPR, 2014 


Feature learning 


¢ Pretrain for image-classification with 1000 classes 
¢ Finetune for object-detection with 200+1 classes 
¢ Use the fine-tuned features for learning SVM 


e Existing approaches mainly investigate on network 
structure 


¢ Number of layers/channels, filter size, dropout 





Girshick, Ross, et al. CVPR, 2014 


Deep model design 


¢ Network structure 





pooling 





pooling 





Annotation level Image Image 


Bbox rejection n y 
MAP (%) 29.9 30.9 
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Deep model design 


¢ Network structure 





pooling 






pooling 





Net structure | AlesNet | AlexNet | Clarita 


Annotation level Image Image Image 


Bbox rejection n y y 
MAP (%) 29.9 30.9 31.8 
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Deep model design 


¢ Network structure 
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Bbox rejection n y y y 
mAP (%) 29.9 30.9 31.8 36.6 289 


Deep model design 


¢ Network structure 


256 480 








_Net structure | AlexNet | AlexNet_| Clarifai_| Overfeat_| GoogleNet_ 


Annotation level Image Image Image Image Image 





Bbox rejection n y y y y 
MAP (%) 29.9 30.9 31.8 36.6 37.8 
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Feature learning — pretrain 


¢ Classification 
— Pretrain for image-classification with 1000 classes 
— Finetune for object detection with 200 classes 
— Gap: classification vs. detection, 1000 vs. 200 





Image classification Object detection 


Feature learning — pretrain 


¢ Classification 
— Pretrain for image-classification with 1000 classes 
— Finetune for object detection with 200 classes 
— Gap: classification vs. detection, 1000 vs. 200 





Pie 


Image classification Object detection 


Feature learning — pretrain 


¢ Classification 





Pretrained on object-level annoation Pretrained on image-level annotation 
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Feature learning — pretrain 


¢ Classification (Cls) 
— Pretrain for image-classification with 1000 classes 
— Gap: classification vs. detection, 1000 vs. 200 

¢ Detection (Loc) 


— Pretrain for object-detection with 1000 classes 


Net structure AlexNet Clarifai Clarifai 





MAP (%) on val2 29.9 31.8 36.0 


Result and discussion 


¢ RCNN (Cls+Det), 
¢ Our investigation 


¢ Better pretraining on 1000 classes 
¢ Object-level annotation is more suitable for pretraining 


/  AlexNet Image annotation ©) 0} (=Tei m-Jalaverecialele 


200 classes (Det) 20.7 32 
1000 classes (Cls-Loc) 31.8 36 
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MAP 31 mam) to 50.57 on val2 


Deep|ID-Net 










££ DeepID-Net ‘ka 
| Al erred = 
SH, — hinge-loss, _ %@ = 
Proposed Remaining © P29 Coffey 
bi bounding boxes bounding boxes 


person 


ite a 


»| rs 
q —_ 
' Ts ’ 


=. 





296 


Feature learning —SVM-net 


e Existing approach 
— Learn features using soft-max loss (Softmax-Net) 
— Train SVM with the learned features 


Estimated result 


Learning 
CNN 





Five Wi 


Feature learning — SVM-net 


e Existing approach 
— Learn features using soft-max loss (Softmax-Net) 
— Train SVM with the learned features 
¢ Replace Soft-max loss by Hinge loss when fine-tuning (SVM-Net) 
— Merge the two steps of RCNN into one 
— Require no feature extraction from training data (~60 hours) 


Estimated result 
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Our pipeline 


Deep|D-Net 
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Proposed def-pooling 
bounding boxes bounding boxes layer 
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Deep model training — def-pooling layer 


¢ RCNN (ImageNet Cls+Det) 


— Pretrain on image-level annotation with 1000 classes 
— Finetune on object-level annotation with 200 classes 
— Gap: classification vs. detection, 1000 vs. 200 

¢ DeepID-Net (ImageNet Loc+Det) 
— Pretrain on object-level annotation with 1000 classes 


— Finetune on object-level annotation with 200 classes 
with def-pooling layers 





Net structure Without Def Layer With Def layer 


MAP (%) on val2 36.0 38.5 
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Deformation 


— Learning deformation [a] is effective in computer vision society. 
— Missing in deep model. 


— We propose a new deformation constrained pooling layer. 





[a] P. Felzenszwalb, R. B. Grishick, D.McAllister, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Trans. PAMI, 
32:1627-1645, 2010. 
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Modeling Part Detectors 


¢ Different parts have different sizes 
¢ Design the filters with variable sizes 
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Part models 

















Part models learned 


from HOG 
Head-torso Head-shoulder Legs 
at level 3 at level 2 at level 2 





Full-body Torso 


Head-shoulder 


at level 3 at level 3 at level 2 


Learned filtered at the second 
convolutional layer 302 


Deformation Layer [|b] 
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[b] Wanli Ouyang, Xiaogang Wang, "Joint Deep Learning for Pedestrian Detection ", ICCV 2013. = 303 


Deformation layer for repeated 
patterns 





Pedestrian detection CT=J a=] e-] me) e) (=o me l-inzodtela 


Assume no repeated pattern Repeated patterns 
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Deformation layer for repeated 


patterns 


Assume no repeated pattern Repeated patterns 





Only consider one object class Patterns shared across different object classes 











Deformation constrained pooling layer 


Can capture multiple patterns simultaneously 






filter 








Max 
pooling H 


Output B 


input Convolution \ 
result M ~~ +) 
hK_/ 








Deformation 
penalty 306 


Our deep model with deformation 
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128 128 
Net structure AlexNet Clarifai ClarifaitDef layer 


Mean AP on val2 0.299 0.360 0.385 307 


MAP 31 mam) to 50.57 on val2 


Deep|ID-Net 
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Context modeling 
¢ Use the 1000 class meeeee—ws=~ 
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Context modeling 


¢e Use the 1000-class Image classification score. 
—~1% mAP improvement. 
— Volleyball: improve ap by 8.4% on val2. 
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MAP 31 mam) to 50.57 on val2 


Deep|ID-Net 
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Model averaging 


¢ Models of different structures are 
complementary on different classes. 
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Annotation Image Object Object 
level 
O " ME TT wae IL elma, 
Bbox n n n | | 
rejection 
mAP (%) 29.9 34.3 35.6 20 


class 


312 


hamster 


MAP 31 mam) to 50.57 on val2 


Deep|ID-Net 










Selective . Box Deep!ID-Net 
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Compa rison with state-of-the-art 


arr 63 al in Ut oe) ei cos Os 
mal ey=iilale lair} RCNN | Berkeley Vision] UvA-Euvision | DeepInsight | GoogLeNet | Ours 





MAP on val2 (avg) n/a_ n/a n/a 44.5 50.7 
MAP on val2 (sgl) n/a 31.0 33.4 a 40.1 38.8 48.2 
mAP on test (avg) 22.6 n/a n/a n/a 40.5 43.9 50.3 
mAP on test (sgl) n/a 31.4 34.5 35.4 40.2 38.0 47.9 


Selective 





Deep!D-Net 
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q om ——SCC#éP retrain, 
Proposed Remaining 2¢!-Pooling 
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Component analysis 


Detection Bie). +bbox |+Edge| +Def | Scale +bbox | Model 
See ee eg 
MAP on val2 29.9 30.9 366 37.8 404 42.7 449 47.3 47.8 48.2 50.7 
MAP on test 47.9 50.3 





Model avg. TT 2.5 
0.4 


bbox regr. 
Context 0.5 


Scale jittering —— 2.4 


Def layer seers 2.2 


Edgebox = _ sees 2.3 
bbox pretrain ——— 2.6 
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Conclusions 


¢ Jointly optimize vision components (joint deep 
learning) 


Propose new layers based on domain knowledge (def- 
pooling layer} 

Carefully design the strategies of learning feature 
representations 

— Feature learned aided by semantic tasks 

— Pre-training with challenging tasks and rich predictions 


— The chosen training tasks help to achieved desired feature 
invariance and discriminative power 


— Adapted to specific tasks in test 


Summary 


Speed-up the pipeline: 

— Bounding rejection. Save feature extraction by about 
10 times, slightly improve mAP (~1%). 

— Hinge loss. Save feature computation time (“60 h). 

Improve the accuracy 


— Pre-training with object-level annotation, more 
classes. 2.6% mAP 


— Def-pooling layer. 2.5% mAP 
— Context. 0.5-1% mAP 


— Model averaging. Different model designs and training 
schemes lead to high diversity 
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Outline 


Introduction to deep learning 

Deep learning for object recognition 
Deep learning for object segmentation 
Deep learning for object detection 
Deep learning for object tracking 
Open questions and future works 


Motivations 


¢ Explore the features pre-trained on massive data and 
classification task on ImageNet 


¢ Atop convolution layer encodes more semantic 
features and serves as a category detector 


¢ A lower convolution layer carries more discriminative 
information and can better separate the target from 
distractors with similar appearance 


¢ Both layers are jointly used with a switch mechanism 
during tracking 


¢ A tracking target, only a subset of neurons are 
relevant 


L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual Tracking with Fully Convolutional Networks,” 
ICCV 2015. 


Observation 1: Different layers encode different types of features. 
Higher layers capture semantic concepts on object categories, 
whereas lower layers encode more discriminative features to 
capture intra class variations 





(a) (b) (c) 


(a) Ground truth target heat map; (b) Predicted heat maps using feature maps of 
top convolution layers of VGG; (c) Predicted heat maps using feature maps of lower 
convolution layers of VGG 


Observation 2: Although the receptive field of CNN feature maps Is 
large, activated feature maps are sparse and localized. Activated 
regions are highly correlated to the regions of semantic objects 
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Activation value histograms of feature maps in top (left) and lower (right) layers 


Observation 3: Many CNN feature maps are noisy or unrelated for 
the task of discriminating a particular target from its background 





(a) (b) (c) 


(a) Ground truth foreground mask, average feature maps of convolution 
layers; average selected feature maps of convolution layers 


Selection of feature maps 


¢ Select feature maps by reconstructing foreground masks and 
their significance calculated with BP 





The sparse coefficients are computed using the images in the first column and 
directly applied to the other columns without change 


Fully convolutional network based 
tracker (FCN) 


GNet: capture the category information of the target and is built on 
the top layers of VGG 


SNet: discriminative the target from background with similar 
appearance and is built on the lower layers of VGG 






Feature Map 
Selection 
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Feature Map 
Selection 





(b) VGG network; (c) SNet; (d) Gnet; (e) Tracking results 


Both GNet and SNet are initialized in the first frame to perform 
foreground heat map regression for the target: GNet is fixed and 
SNet is updated every 200 frames 


SNet is used is the background distractor is larger than a threshold; 
otherwise GNet is used 


For a new frame, a region of interest (ROI) centered at the last 
target location containing both target and background context is 
cropped and propagated through the fully convolutional network 
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Feature Map 
Selection 


(b) VGG network; (c) SNet; (d) Gnet; (e) Tracking results 


Precision 


Precision plots and success plots of OPE for 
the top 10 trackers 
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Precision plots of OPE 
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Outline 


¢ Open questions and future works 


“Concerns” on deep learning 


¢ C1: Weak on theoretical support (convergence, 
bound, local minimum, why it works} 
— |t’s true. That’s why deep learning papers were not 


accepted by the computer vision/image processing 
community for a long time. Any theoretical studies in the 


future are important. 


Most computer Deep learning papers for 
vision/multimedia papers computer vision/multimedia 


New objective function | 


New network structure and 
New optimization algorithm 


new objective function 





Back propagation (standard) 


Theoretical analysis 
Experimental results 


Super experimental results 





il 


That’s probably one of the reasons that computer vision and image 
processing people think deep learning papers are lack of novelty and 
theoretical contribution @ 


“Concerns on deep learning 


¢ (C2: It is hard for computer vision/image processing people to 
have innovative contributions to deep learning. Our job 
becomes preparing the data + using deep learning as a black 
box. That’s the end of our research life. 


— That’s not true. Computer vision and image processing researchers 
have developed many systems with deep architectures. But we just 
didn’t know how to jointly learn all the components. Our research 
experience and insights can help to design new deep models and pre- 
training strategies. 


— Many machine learning models and algorithms were motivated by 
computer vision and image processing applications. However, 
computer vision and multimedia did not have close interaction with 
neural networks in the past 15 years. We expect fast development of 
deep learning driven by applications. 


“Concerns on deep learning 


¢ C3: Since the goal of neural networks is to solve the 
general learning problem, why do we need domain 
knowledge ? 
— The most successful deep model on image and video 


related applications is convolutional neural network, which 
has used domain knowledge (filtering, pooling} 


— Domain knowledge is important especially when the 
training data is not large enough 


“Concerns on deep learning 


¢ C4: Good results achieved by deep learning come 

from manually tuning network structures and 

learning rates, and trying different initializations 

— That’s not true. One round evaluation may take several 
weeks. There is no time to test all the settings. 

— Designing and training deep models does require a lot of 
empirical experience and insights. There are also a lot of 
tricks and guidance provided by deep learning researchers. 


Most of them make sense intuitively but without strict 
proof. 


“Concerns on deep learning 


¢ C5: Deep learning is more suitable for industry rather 
than research groups in universities 
— Industry has big data and computation resources 


— Research groups from universities can contribute on model 
design, training algorithms and new applications 


“Concerns on deep learning 


¢ C6: Deep learning has different behaviors when the 
scale of training data is different 
— Pre-training is useful when the training data small, but 


does not make big difference when the training data is 
large enough 


— So far, the performance of deep learning keep increasing 
with the size of training data. We don't see its limit yet. 

— Shall we spend more effort on data annotation or model 
design? 


“Concerns on deep learning 


¢ C7: Deep learning is neural network, which is old 


— Studying the behaviors of neural network under large scale 
training is new 


Future works 


¢ Explore deep learning in new applications 


— Worthy to try if the applications require features or 
learning, and have enough training data 


— We once had many doubts on deep. (Does it work for 
vision? Does it work for segmentation? Does it work for 
low-level vision?) But deep learning has given us a lot of 
Surprises. 


— Applications will inspire many new deep models 
¢ Incorporate domain knowledge into deep learning 


e Integrate existing machine learning models with 
deep learning 


Future works 


Deep learning to extract dynamic features for video 
analysis 


Deep models for structured data 
Theoretical studies on deep learning 


Quantitative analysis on how to design network 
structures and how to choose nonlinear operations of 
different layers in order to achieve feature invariance 


New optimization and training algorithms 


Parallel computing systems and algorithm to train very 
large and deep networks with larger training data 


Projects 


Multimedia Laboratory 
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Description Download 
& demo code that allows you to input a pedestrian image and then compute the label rap. Zip 
Reference: 


1. P. Luo, &. Wang, and «. Tang, “Pedestrian Parsing via Deep Decompasitional Neural Network," in Proceedings of (EEE (nfernafional Conference an 
Caomoufer vision (OCW) 2013 [POF] [Project Page] 





& demo code that shows vou how the frontal-view face image of a query face image is reconstructed. Zip 









ReTerence: 
1. 2. 2hu, PF. Luo, &. Yang, and x. Tang, "Deep Learning Identity Preserving Face Space," in Proceedings of (EEE Infernafional Conference on 


Computer Vision (CCV) 2013 [PDF] [Project Page] 


Matlab training and testing source code for pedestrian detection using the proposed approach. Models trained on INRIA and Caltech are provided. Webpage 


*i i Reference: 
r tet 1. Wanli Ouyang, Xiaogang Wang, "Joint Deep Learning for Pedestrian Detection", in Proceedings of (EEF international Conference on Computer vision 
(CCV) 2013 [POF] [Project Page] 
2. Wanli Quyang, Xiaogang Wang, "A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling", in Proceedings of EEE Conference 
on Gomoter Vision and Pattern Recagnitian (CVPR) 2012 [POF] [Project Page] 


Executable files for the face detector and facial point detector. Webpage 
RETErEnCe. 


1. Yo sun, A. Wang and &. Tang, "Deep Convolutional Network Cascade for Facial Point Detection," in Proceedings of (EEE Conference on Compoufer 
Wision and Patiern Recognifian (CVPR), pp. 3476-3463, 2013 [PDF] [Project Page] 





http://mmlab.ie.cuhk.edu.hk/project_ deep learning.html 
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